Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Q-DETR: An Eﬃcient Low-Bit Quantized Detection Transformer

2.4.4

Distribution Rectiﬁcation Distillation

Inner-level optimization. We ﬁrst detail the maximization of self-information entropy.

According to the deﬁnition of self information entropy, H(q^S) can be implicitly expanded

as:

H(q^S) = −

q^S

i ^∈^q^S^p⁽^q^S

i ^)log^p⁽^q^S

i ⁾^.

(2.30)

However, an explicit form of H(q^S) can only be parameterized with a regular distribution

p(q^S

i ^{). Luckily, the statistical results in}^{Fig. 2.8}^{show that the query distribution tends to}

follow a Gaussian distribution, also observed in [136]. This enables us to solve the inner-

level optimization in a distribution alignment fashion. To this end, we ﬁrst calculate the

mean μ(q^S) and variance σ(q^S) of query q^Swhose distribution is then modeled as q^S∼

N(μ(q^S), σ(q^S)). Then, the self-information entropy of the student query can proceed as:

H(q^S) = −E[log N(μ(q^S), σ(q^S))]

= −E[log[(2πσ(q^S)

2 exp(−(q^S

i ⁻^μ⁽^q^S⁾⁾

2σ(q^S)²

)]]

= ¹

2 ^{log 2}^πσ⁽^q^S⁾

(2.31)

The above objective reaches its maximum of H(q^S^∗) = (1/2) log 2πe[σ(q^S)

2 +ϵqS] when

q^S^∗= [q^S−μ(q^S)]/[

σ(q^S)²+ ϵqS] where ϵqS = 1e⁻⁵is a small constant added to prevent

a zero denominator. The mean and variance might be inaccurate in practice due to query

data bias. To solve this, we use the concepts in batch normalization (BN) [207, 102] where

a learnable shifting parameter βqS is added to move the mean value. A learnable scaling

parameter γqS is multiplied to move the query to the adaptive position. In this situation,

we rectify the information entropy of the query in the student as follows:

q^S^∗=

q^S−μ(q^S)

σ(q^S)²+ ϵqS

γqS + βqS,

(2.32)

in which case the maximum self-information entropy of student query becomes H(q^S^∗) =

(1/2) log 2πe[(σ²

q^S⁺^ϵ^q^S⁾^/γ²

q^S^{]. Therefore, in the forward propagation, we can obtain the}

current optimal query q^S^∗via Eq. (2.32), after which, the upper-level optimization is further

executed as detailed in the following contents.

Upper-level optimization. We continue minimizing the conditional information en-

tropy between the student and the teacher. Following DETR [31], we denote the ground-

truth labels by y^GT= {c^GT

, b^GT

}^N^gt

i=1 ^{as a set of ground-truth objects where}^N^gt^{is the num-}

ber of foregrounds, c^GT

and b^GT

respectively represent the class and coordinate (bounding

box) for the i-th object. In DETR, each query is associated with an object. Therefore, we

can obtain N objects for teacher and student as well, denoted as y^S= {c^S

j ^{, b}^S

j ^}^N

j=1 ^and

y^T= {c^T

j ^{, b}^T

j ^}^N

j=1^.

The minimization of the conditional information entropy requires the student and

teacher objects to be in a one-to-one matching. However, it is problematic for DETR due

primarily to the sparsity of prediction results and the instability of the query’s predic-

tions [129]. To solve this, we propose a foreground-aware query matching to rectify “well-

matched” queries. Concretely, we match the ground-truth bounding boxes with this student

to ﬁnd the maximum coincidence as:

Gi = max

1≤j≤N ^GIoU(^b^GT

, b^S

j ⁾^,

(2.33)